354 research outputs found

    Inferring ancestral sequences in taxon-rich phylogenies

    Full text link
    Statistical consistency in phylogenetics has traditionally referred to the accuracy of estimating phylogenetic parameters for a fixed number of species as we increase the number of characters. However, as sequences are often of fixed length (e.g. for a gene) although we are often able to sample more taxa, it is useful to consider a dual type of statistical consistency where we increase the number of species, rather than characters. This raises some basic questions: what can we learn about the evolutionary process as we increase the number of species? In particular, does having more species allow us to infer the ancestral state of characters accurately? This question is particularly relevant when sequence site evolution varies in a complex way from character to character, as well as for reconstructing ancestral sequences. In this paper, we assemble a collection of results to analyse various approaches for inferring ancestral information with increasing accuracy as the number of taxa increases.Comment: 32 pages, 5 figures, 1 table

    Fast NJ-like algorithms to deal with incomplete distance matrices

    Get PDF
    RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are.Abstract Background Distance-based phylogeny inference methods first estimate evolutionary distances between every pair of taxa, then build a tree from the so-obtained distance matrix. These methods are fast and fairly accurate. However, they hardly deal with incomplete distance matrices. Such matrices are frequent with recent multi-gene studies, when two species do not share any gene in analyzed data. The few existing algorithms to infer trees with satisfying accuracy from incomplete distance matrices have time complexity in O(n4) or more, where n is the number of taxa, which precludes large scale studies. Agglomerative distance algorithms (e.g. NJ 12) are much faster, with time complexity in O(n3) which allows huge datasets and heavy bootstrap analyses to be dealt with. These algorithms proceed in three steps: (a) search for the taxon pair to be agglomerated, (b) estimate the lengths of the two so-created branches, (c) reduce the distance matrix and return to (a) until the tree is fully resolved. But available agglomerative algorithms cannot deal with incomplete matrices. Results We propose an adaptation to incomplete matrices of three agglomerative algorithms, namely NJ, BIONJ 3 and MVR 4. Our adaptation generalizes to incomplete matrices the taxon pair selection criterion of NJ (also used by BIONJ and MVR), and combines this generalized criterion with that of ADDTREE 5. Steps (b) and (c) are also modified, but O(n3) time complexity is kept. The performance of these new algorithms is studied with large scale simulations, which mimic multi-gene phylogenomic datasets. Our new algorithms – named NJ*, BIONJ* and MVR* – infer phylogenetic trees that are as least as accurate as those inferred by other available methods, but with much faster running times. MVR* presents the best overall performance. This algorithm accounts for the variance of the pairwise evolutionary distance estimates, and is well suited for multi-gene studies where some distances are accurately estimated using numerous genes, whereas others are poorly estimated (or not estimated) due to the low number (absence) of sequenced genes being shared by both species. Conclusion Our distance-based agglomerative algorithms NJ*, BIONJ* and MVR* are fast and accurate, and should be quite useful for large scale phylogenomic studies. When combined with the SDM method 6 to estimate a distance matrix from multiple genes, they offer a relevant alternative to usual supertree techniques 7. Binaries and all simulated data are downloadable from 8.Published versio

    The combinatorics of overlapping genes

    Get PDF
    Overlapping genes exist in all domains of life and are much more abundant than expected at their first discovery in the late 1970s. Assuming that the reference gene is read in frame +0, an overlapping gene can be encoded in two reading frames in the sense strand, denoted by +1 and +2, and in three reading frames in the opposite strand, denoted by -0, -1 and -2. This motivated numerous researchers to study the constraints induced by the genetic code on the various overlapping frames, mostly based on information theory. Our focus in this paper is on the constraints induced on two overlapping genes in terms of amino acids, as well as polypeptides. We show that simple linear constraints bind the amino acid composition of two proteins encoded by overlapping genes. Novel constraints are revealed when polypeptides are considered, and not just single amino acids. For example, in double-coding sequences with an overlapping reading frame -2, each Tyrosine (denoted as Tyr or Y) in the overlapping frame overlaps a Tyrosine in the reference frame +0 (and reciprocally), whereas specific words (e.g. YY) never occur. We thus distinguish between null constraints (YY = 0 in frame -2) and non-null constraints (Y in frame +0 Y in frame -2). Our equivalence-based constraints are symmetrical and thus enable the characterization of the joint composition of overlapping proteins. We describe several formal frameworks and a graph algorithm to characterize and compute these constraints. These results yield support for understanding the mechanisms and evolution of overlapping genes, and for developing novel overlapping gene detection methods

    A 'stochastic safety radius' for distance-based tree reconstruction

    Full text link
    A variety of algorithms have been proposed for reconstructing trees that show the evolutionary relationships between species by comparing differences in genetic data across present-day taxa. If the leaf-to-leaf distances in a tree can be accurately estimated, then it is possible to reconstruct this tree from these estimated distances, using polynomial-time methods such as the popular `Neighbor-Joining' algorithm. There is a precise combinatorial condition under which distance-based methods are guaranteed to return a correct tree (in full or in part) based on the requirement that the input distances all lie within some `safety radius' of the true distances. Here, we explore a stochastic analogue of this condition, and mathematically establish upper and lower bounds on this `stochastic safety radius' for distance-based tree reconstruction methods. Using simulations, we show how this notion provides a new way to compare the performance of distance-based tree reconstruction methods. This may help explain why Neighbor-Joining performs so well, as its stochastic safety radius appears close to optimal (while its more classical safety radius is the same as many other less accurate methods).Comment: 18 pages, 1 figure, 4 table

    Inferring evolutionary trees with strong combinatorial evidence

    Get PDF
    We consider the problem of inferring the evolutionary tree of a set of n species. We propose a quartet reconstruction method which specifically produces trees whose edges have strong combinatorial evidence. Let Q be a set of resolved quartets defined on the studied species, the method computes the unique maximum subset Q* of Q which is equivalent to a tree and outputs the corresponding tree as an estimate of the species' phylogeny. We use a characterization of the subset Q* due to (Bandelt86) to provide an O(n4) incremental algorithm for this variant of the NP-hard quartet consistency problem. Moreover, when chosing the resolution of the quartets by the Four-Point Method (FPM) and considering the Cavender-Farris model of evolution, we show that the convergence rate of the Q* method is at worst polynomial when the maximum evolutive distance between two species is bounded. We complete these theoretical results by an experimental study on real and simulated data sets. The results show that (i) as expected, the strong combinatorial constraints it imposes on each edge leads the Q* method to propose very few incorrect edges; (ii) more surprisingly, the method infers trees with a relatively high degree of resolution

    Deep conservation of human protein tandem repeats within the eukaryotes

    Get PDF
    Tandem repeats (TRs) are a major element of protein sequences in all domains of life. They are particularly abundant in mammals, where by conservative estimates one in three proteins contain a TR. High generation-scale duplication and deletion rates were reported for nucleic TR units. However, it is not known whether protein TR units can also be frequently lost or gained providing a source of variation for rapid adaptation of protein function, or alternatively, tend to have conserved TR unit configurations over long evolutionary times. To obtain a systematic picture for proteins TRs, we performed a proteome-wide analysis of the mode of evolution for human TRs. For this purpose, we propose a novel method for the detection of orthologous TRs based on circular profile hidden Markov models. For all detected TRs we reconstructed bi-species TR unit phylogenies across 61 eukaryotes ranging from human to yeast. Moreover, we performed additional analyses to correlate functional and structural annotations of human TRs with their mode of evolution. Surprisingly, we find that the vast majority of human TRs are ancient, with TR unit number and order preserved intact since distant speciation events. For example, ≄61% of all human TRs have been strongly conserved at least since the root of all mammals, approximately 300 Mya ago. Further, we find no human protein TR that shows evidence for strong recent duplications and deletions. The results are in contrast to high generation-scale mutability of nucleic TRs. Presumably, most protein TRs fold into stable and conserved structures that are indispensable for the function of the TR-containing protein. All of our data and results are available for download from http://www.atgc-montpellier.fr/TRE

    Les espaces de l'halieutique

    Get PDF
    L'objet de l'article est la présentation d'un modÚle spatialisé forcé par l'environnement de la population de thons albacore de l'Atlantique. Le modÚle s'appuie sur des relations non linéaires estimées par modélisation additive généralisée (GAM) caractérisant, d'une part les préférences environnementales des albacores et d'autre part leur capturabilité à différents engins. Formulées analytiquement, les relations caratéristiques des préférences environnementales des albacores sont utilisées pour forcer un modÚle d'advection-diffusion-réaction des albacores. Egalement formulées analytiquement, les relations caractérisant la capturabilité à différents engins permettent d'envisager l'ajustement du modÚle aux captures observées. Le modÚle permet de simuler la répartition des animaux en fonction de l'environnement océanique et des captures réelles. A travers différentes simulations, on s'intéresse au phénomÚne de surexploitation locale des thons adultes dans le Golfe de Guinée. La trÚs grande ampleur du phénomÚne observée dans les simulations est discutée. (Résumé d'auteur

    Deep conservation of human protein tandem repeats within the eukaryotes

    Get PDF
    Tandem repeats (TRs) are a major element of protein sequences in all domains of life. They are particularly abundant in mammals, where by conservative estimates one in three proteins contain a TR. High generation-scale duplication and deletion rates were reported for nucleic TR units. However, it is not known whether protein TR units can also be frequently lost or gained providing a source of variation for rapid adaptation of protein function, or alternatively, tend to have conserved TR unit configurations over long evolutionary times. To obtain a systematic picture for proteins TRs, we performed a proteome-wide analysis of the mode of evolution for human TRs. For this purpose, we propose a novel method for the detection of orthologous TRs based on circular profile hidden Markov models. For all detected TRs we reconstructed bi-species TR unit phylogenies across 61 eukaryotes ranging from human to yeast. Moreover, we performed additional analyses to correlate functional and structural annotations of human TRs with their mode of evolution. Surprisingly, we find that the vast majority of human TRs are ancient, with TR unit number and order preserved intact since distant speciation events. For example, ≄61% of all human TRs have been strongly conserved at least since the root of all mammals, approximately 300 Mya ago. Further, we find no human protein TR that shows evidence for strong recent duplications and deletions. The results are in contrast to high generation-scale mutability of nucleic TRs. Presumably, most protein TRs fold into stable and conserved structures that are indispensable for the function of the TR-containing protein. All of our data and results are available for download from http://www.atgc-montpellier.fr/TRE

    Rapidly Computing the Phylogenetic Transfer Index

    Get PDF
    Given trees T and T_o on the same taxon set, the transfer index phi(b,T_o) is the number of taxa that need to be ignored so that the bipartition induced by branch b in T is equal to some bipartition in T_o. Recently, Lemoine et al. [Lemoine et al., 2018] used the transfer index to design a novel bootstrap analysis technique that improves on Felsenstein\u27s bootstrap on large, noisy data sets. In this work, we propose an algorithm that computes the transfer index for all branches b in T in O(n log^3 n) time, which improves upon the current O(n^2)-time algorithm by Lin, Rajan and Moret [Lin et al., 2012]. Our implementation is able to process pairs of trees with hundreds of thousands of taxa in minutes and considerably speeds up the method of Lemoine et al. on large data sets. We believe our algorithm can be useful for comparing large phylogenies, especially when some taxa are misplaced (e.g. due to horizontal gene transfer, recombination, or reconstruction errors)
    • 

    corecore